Language models (LMs) exhibit striking capabilities on various important tasks but despite such high performance, LMs generated content are usually prone to be hallucinatory, factually incorrect, and harmful. To address this many recent works are focused on in-place updates for LMs called Interventions, which aim to update targeted properties of LMs applied, without impacting unrelated behaviors or adding excessive compute, after pretraining (and optional fine-tuning).
Usually a LLM might require multiple such interventions over a time for inference- or memory-efficiently, knowledge editing, detoxification, and unlearning. To handle such requirements researchers have come up with a novel method called composable interventions, a framework to study the effects of using multiple interventions on the same language models, featuring new metrics and a unified codebase.
When an intervention is applied to a model, it should not interfere with prior or future interventions. For example if a model is quantized it should not affect its knowledge editing intervention which was applied earlier. To manage this composable interventions framework use two metrics for composability: 1) Order-free Error, where an intervention is composable if its application leaves others’ success unimpacted, and 2) Order Sensitivity, where the combined success of multiple interventions should not depend on the order in which they are applied.
During extensive experimentation, frameworks were evaluated on composing popular methods from three emerging intervention categories—knowledge editing, model compression, and machine unlearning. Results from 310 different compositions uncover meaningful interactions: compression hinders editing and unlearning, composing interventions hinges on their order of application, and popular general-purpose metrics are inadequate for assessing composability. Taken together, findings showcase clear gaps in composability, suggesting a need for new multi-objective interventions.